A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data
نویسندگان
چکیده
The cpquantile of an ordered sequence of data values is the element with rank ‘pn, where n is the total number of values. Accurate estimates of quantiles are required for the solution of many practical problems. In this paper, we present a new algorithm for estimating the quantile values for disk-resident data. Our algorithm has the following characteristics: (1) It requires only one pass over the data; (2) It is deterministic; (3) It produces good lower and upper bounds of the true values of the quantiles; (4) It requires no a priori knowledge of the distribution of the data set; (5) It has a scalable parallel formulation; (6) Extra time and memory for computing additional quantiles (beyond the first one) are constant per quantile. We present experimental results on the IBM SP-2. The experimental results show that the algorithm is indeed robust and does not depend on the distribution of the data sets.
منابع مشابه
[7] A. Asuncion and D. J. Newman. UCI Machine Learning Repository
[3] Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. A one-pass algorithm for accurately estimating quantiles for disk-resident data. [8] Jürgen Beringer and Eyke Hüllermeier. An efficient algorithm for instance-based learning on data streams.
متن کاملNovel Algorithms for Computing Medians and Other Quantiles of Disk-Resident Data
In data warehousing applications, numerous OLAP queries involve the processing of holistic operations such as computing the "top N", median, etc. Efficient implementations of these operations are hard to come by. Several algorithms have been proposed in the literature that estimate various quantiles of disk-resident data. Two such recent algorithms are based on sampling. In this paper we presen...
متن کاملHow to Summarize the Universe: Dynamic Maintenance of Quantiles
Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations , in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data...
متن کاملEstimating Quantiles from the Union of Historical and Streaming Data
Modern enterprises generate huge amounts of streaming data, for example, micro-blog feeds, financial data, network monitoring and industrial application monitoring. While Data Stream Management Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over his...
متن کاملEstimating Aggregate Properties on Probabilistic Streams
The probabilistic-stream model was introduced by Jayram et al. [16]. It is a generalization of the data stream model that is suited to handling \probabilistic" data where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over potentially a very large number of classical \deterministic" streams...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997